X1E SIG report
Chairs: Mark Fahey (ORNL) and Jim Glidewell (Boeing)
During the Cray User Group 2007, the X1E Special Interest Group meeting was held Wednesday, May 9 at 4:00pm. In attendance were 19 people, including members from Boeing, Cray, ORNL, Rutgers University, HLRS, SMDC and NASA Ames. Mark Fahey ran the combined X1E User and System SIG. Cray attendance (8) was excellent.
After a brief introduction by the Chair, Mark Fahey of ORNL, Cray was given the opportunity to make announcements and provide comments. First, Steve Johnson of Cray took the opportunity to comment that the X1 line is basically in a maintenance mode and so no new development is being done. However, he made it clear that they are still supporting the machine and if any critical problems arose Cray would deal with them and provide fixes. He also noted that Cray has released kernel bug-fixes over the past year. Then, Luiz DeRose of Cray talked about where the programming environment was and where it was going – that is, PE 5.6 was just released and 6.0 is coming and this latter release would be the last for the X1 line. Upon announcing this, the first issue came up – that is that there are at least two sites who have at least one application that requires the use of PE 5.4.7. Luiz then wondered why there were no SPRs on either instance – it turns out that
one site did not have an SPR on it and consequently was highly suggested to have an SPR opened; and the other site actually does have an SPR and there is active movement on it and in actuality there is ongoing work to verify that 5.6 actually does work for the case that requires 5.4.7.
Then the session was opened up for any comments/issues. Jim Glidewell (Boeing) then commented on his dismay of salesforce.com as a service portal and his continuing disappointment in how submitting SPRs and getting information from CRInform works. This comment at least matched a few other’s opinions from the 1 on 100 session and the XT SIG session. Steve Johnson
commented that work was in progress to provide a common problem tracking tool to be used by both customers and Cray employees. Mark Fahey (ORNL) then followed this somewhat lengthy discussion with a complaint about how long some of the SPRs he has submitted have been in the CRInform system. For example, he said that he has 4 SPRs on PE issues dating back to 2004.
His particular complaint was on the very long time in the system and wanted some better monitoring of tickets, and maybe an increase in priority on an SPR as it ages.
Jim Glidewell suggested that Cray think hard about automated processes or other OS hooks to make gathering useful troubleshooting information as easy as possible, since larger system sizes are making it often difficult to reproduce and/or diagnose problems.
The chair then asked for more issues, and after no issues were raised, he then went to a slide listing several issues (actually presented last CUG). The issues were on topics like a need for better integration of psched and PBS, need for more efficient migrator, desire for better “psview –m” output, and the ability to generate only 1 core file (as happens on the XT line.) Since the OS has been in a maintenance mode for the last year, there was no expectation of these changing but the chair still wanted to bring them up and each generated a healthy discussion and Cray was going to look into how they might deal with some of these.
The meeting ended with a short discussion about the future of CUG SIGs going forward. They are likely to be reorganized. The audience was asked for suggestions on the future organization and also for any who would like to be involved, but little input from customer sites were provided. Jim Glidewell suggested that the X1 SIG would likely persist yet another year in its current form, and so informal elections were held for who wanted to be the chair and deputy next year. The elections resulted in Jim Glidewell being the new Chair (replacing Mark Fahey) and Rolf Rabenseifner being the deputy.
It was felt that there should still be a single X1 SIG, since the platform will continue in the field for some time to come, and many X1 issues are unique to that platform. There was some consensus that splitting a “product” SIG into parallel “user” and “system” sessions is a problem because many attendees have interests in both. Steven Johnson noted that at past CUG’s we had “question and answer” sessions for both hardware and software, and that we may want to revive this session type.